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Abstract: Thousands of different forms (words) are associated with thousands 
of different meanings (concepts) in a language computer model. Reasonable agree- 
ment with reality is found for the number of languages in a family and the Ham- 
ming distances between languages. 

1 Introduction 

The competition between languages of adult people [T] has been intensively 
simulated on computers [21 E] or mathematically for several years. When 
language structures were studied, they usually consisted of about a dozen 
features, often binary [5l[6l[7]; see [8j for a review. This number corresponds 
roughly to the 47 statistically independent language features [9] in the World 
Atlas of Language Structures , which relate to phonology, morphology, 
and syntax. In contrast, thousands of words are needed in everyday life for 
thousands of different concepts, not counting special terms e.g. from the 
sciences. 

While the origin of words has already been simulated [11], we want to 
simulate the subsequent proliferation and competition between thousands of 
languages, each containing thousands of forms for thousands of meanings. In 
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addition we try to get realistic statistics for the number of language families 
containing a given number of languages, and for the similarity of languages 
within one family and between different families. 

Histogram: F=Q=7000, tau=40, t=60, split=15% 
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Figure 1: Family size distributions. The symbols connected with lines corre- 
spond to the parameter settings in the headline, while those not connected 
with lines have F = Q = 2000 and t = 100. In both cases three samples are 
shown differing only in the random numbers. The slope of the straight line 
corresponds to the empirical power law of [14J; see also |15j . 

In the present paper we regard grammatically related words (e.g., life, live, 
lives, lived, living) as one "form", and denote similarly related concepts by 
one " meaning" . In the terminology of linguistics this corresponds to looking 
only at lexical morphemes, ignoring various inflections and derivations. Thus 
our N languages consist each of F meanings and Q forms; each meaning 
i = 1,2, ... F is expressed by one form Si = 1,2, .. .Q. One form may be 
associated to several meanings, but no meaning is associated to several forms. 
In reality the latter case, called homophony by linguists, does occur, but is 
somewhat rarer than the former case, termed polysemy. 
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Within (lower data) and between (higher data) families 
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Figure 2: Evolution of Manhattan Hamming distances for the simulations of 
Fig.l. The simulations up to 60 iterations refer to F = Q = 7000, those for 
t < 100 to F = g = 2000. See Fig. 14 in fT6] for similar results. 



The simulations allow for cases where a given meaning is not realised in a 
given language, taking into account the sensitivity of the lexical inventories 
of languages to differences in cultural and natural environments. Such an 
unreahzed meaning could be denoted by = Q. 

We start with one language and one form, where all meanings have the 
central form = Qjl] thus both the initial evolution of languages and their 
later competition are simulated. Then we apply three processes: Change 
("mutation") and diffusion ("transfer") of single features Si as in the Schulze 
model [8], plus splitting [12] and merging of whole languages. In this last 
(new?) process, two languages which agree in all their Si at one time are 
regarded as one language from then on, changing, diffusing, and splitting 
together, and potentially undergoing further merging with other languages. 
The real-world parallel to merging would be cases where incipient differences 
disappear shortly after they arise, something that happens when children 
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F=Q=7000,Lmax=70000,f=0.1 ,split=1 0(+),1 5(x),20n% 




Figure 3: Top: Three simulations with the modified definition of "family", 
the line corresponds to reality [Hj. Bottom: Sample to sample fluctuations 
when only the random number seed is changed. 



change "wrong" forms popular among their peers to grown-up "correct" 
forms, when slang forms are invented and later forgotten again, when in- 
group varieties emerge and disappear, or when speakers of dialects shift to 
the standard variety. Different from the Schulze model and more similar to 
the Viviane und Tuncay models [121 112], we no longer simulate each indi- 
vidual but only the language as a whole. Thus the "population" for one 
language no longer is part of this model, and therefore, in contrast to the 
Schulze model, we have no shift from languages spoken by few people to more 
widespread languages, only merging of similar variants, as mentioned above. 
And we cannot determine a language size distribution, only a distribution 
for the number of languages within one language family. Otherwise the new 
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Histogram of yes-no distances, same simulations 
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Figure 4: Distribution of Hamming distances in the simulations of Fig.3 top. 
See Fig. 14 in [16] for similar results. 

model is similar to the Schulze model. 

In the next section we define the parameters of this model, then present 
our results, and in section 4 offer some modifications.. 

2 Model 

A "language" is defined by a string of F forms Si, for each meaning i between 
1 and F, where Si is an integer between 1 and Q. Thus different languages 
are possible. At each iteration t = 1, 2, ... we go in the same order through 
all N{t) different languages existing at that time t. Each of the F language 
features at each iteration undergoes with probability p a change, which means 
with probability q it takes over the corresponding Si of a randomly selected 
language then existing in the model, and with probability 1 — g it changes 
its own Si by ±1 (but not below 1 or above Q). Also, at each iteration all 
language pairs agreeing in all their corresponding Si are merged into one 




5 



F=Q=7000, Lmax=70,000, split=f=10% 
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Figure 5: Time variation of the number L of languages in one of the samples 
of Fig.3 top. 



language {N N — 1), and all languages surviving this merging split with 
probability s into two languages {N ^ N + 1) which from then on may 
diverge through change (p) and diffusion (g). 

We start with N — 1 language and during the first r iterations switch 
off the merging process (since otherwise we would always stay at one, albeit 
changing, language). At the end of this time lag t — t each language is 
regarded as the founder of one language family, comprising e.g. all Indo- 
European languages. The "size" of a family is the number of different lan- 
guages in it, arising from the later splitting process. 

In this sense the model combines language evolution and language com- 
petition. A Fortran program is available upon request. 
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p=0.01 0;q=0.1 (+),0.3(x),0.5n,0.7(sq.),0.9(full sq.) 
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Figure 6: Modified family size distributions for various diffusion probabilities 
q, with p = 0.01 (top) and 0.001 (bottom); F = Q = 7000. The lines again 
indicate reality |14| . 

3 Results 

Mostly we worked with 7000 meanings and 7000 forms, at p = 0.001, q = 
0.9, s = 0.15, T = 40, 10^ iterations. Figure 1 shows the size distribution 
for the families, which seems to be log-normal (parabola in this log-log plot). 
The plot is based on about 6000 languages in about 500 language families 
which are realistic numbers [H] for the number of languages in the present 
world. But since the distribution is more log-normal than power-law, the 
largest real families with about 1000 languages are missing in this model. 

The Manhattan Hamming distance is the sum over all absolute differences 
between the corresponding Si of the two languages. Figure 2 gives the time 
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p = 0.01 0, q=0.1 , 0.3, 0.5, 0.7, 0.9 (right to left) 
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Figure 7: Modified yes-no Hamming distance distributions for various diffu- 
sion probabilities q, witli p = 0.01, F = Q = 7000. 

dependence of tfie average Manfiattan Hamming distances witliin one family 
and between different families; the second one is about twice as large as 
the first one, which seems reasonable [17]. All our simulations are non- 
equilibrium ones except for the number of families: The number of languages 
and the Hamming distances increase with time. 

4 Modifications 

A better result for the family size distribution, but a worse one for the Ham- 
ming distances, is obtained by a different family definition: whenever a new 
language is formed due to the splitting process, with probability / = 0.1 it 
is regarded as the founder of a new language family, with all later offspring 
languages belonging to it as long as they do not found a new family. Then 
we see in Fig.3 a better straight line (power law), extending to larger family 
sizes and agreeing with reality [Ilj. The distributions of Hamming distances 
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look as before, Fig. 4, but now due to the random definition of "family" there 
is no major difference between Hamming distances within one and between 
different language families. In these simulations we adopted a Verhulst ex- 
tinction probability L/Lmax for each language at each iteration, where L is 
the current number of languages and Lmax = 70, 000. As a result the number 
of languages may stabilise as in Fig.5, even though the Hamming distances 
still increase; the number of families is about 100. Fig. 6 shows that vari- 
ations on the change and diffusion probabilities p, q give roughly the same 
family size distributions, and Fig. 7 shows the variation of the Hamming dis- 
tance histogram. (The numbers of families with only few language is lower 
in the simulations of Figs. 4 and 7 than in reality; perhaps the simulations 
underestimate the extinctions of languages.) 

The merging process was introduced to achieve, without our Verhulst 
extinction, a stationary number of languages at long t and very small F and 
Q. For the large F, Q and small t used here this aim is not achievable. The 
family sizes without merging look similar (not shown). For small F = Q = 
100, 200, 500 we made up to a million iterations and then see how the average 
Manhattan Hamming distance stabilises. Its yes-no variant, counting only 
the number of different forms and not the amount of their difference, gets 
close to its maximum value. Fig. 8, for increasing time (with extinction and 
merging) . 

In Fig. 9 we compare simulation results with empirical data from 859 
languages collected as part of a project on automated lexicostatistics. For 
each language, a 40-word subset p,8j of the Swadesh 100-word basic vocab- 
ulary list [19] was transcribed in a standard orthography [20j. The distance 
between any two languages was defined as the percentage of attested words 
on the list that fail to match according to objective rules ^U\. The data 
are well fit by the simulation with a low diffusion probability, q = 0.1: real 
and simulated languages differ typically in about 97% of their words. (Only 
near 100 % simulations and reality differ, since very few simulated languages 
differ by 100 %. With much smaller F = Q = 50 this peak does not shift; 
with q = 0.01 the peak shifts to 98 %. Thus this discrepancy does not go 
away.) This result suggests that borrowing is not the source of most changes 
in basic vocabulary. 

The greater variability of the data relative to the simulation may be 
attributed to two factors: first, random sampling variability in the data, 
which are based on a 40-word sample from a much larger true number of 
words; and second, the fact that likelihood of words matching in the data 
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varies with the length of the words, while the simulation treats all words 
equally. 

Finally, the largest languages F = Q = 7000 used up to now are still 
smaller than the actual numbers of words used in normal speech (without 
technical expressions and names). With a larger computer memory of 8 
Gigabytes, and weeks of continuous simulation time, up to 200,000 meanings 
and forms were simulated with p = 0.001, q = 0.1, f = s = 0.1, r = 
50, surpassing the 10^ words learned by adulthood [21]. Fig. 10a shows no 
significant changes in the scaled position of the maximum for the yes-no 
Hamming distance, when F = Q is increased (also for F = Q = 3000 and 
65,000; not shown). Fig. 10b shows the whole distributions of the unsealed 
yes-no Hamming distance at F = Q = 200, 000 for various times t; at the 
right end of the plot we see a single maximum at t = 10^, followed to its 
right by three peaks of decreasing height at t = 10^. Fig. 10 thus confirms 
for larger F = Q what was seen already in Figs. 8, 9, that for long times t 
a Hamming distance close to but below 100 percent is reached for the peak 
position. 

5 Summary 

Our model allowed for the first time for the simulation of thousands of mean- 
ings and forms. The resulting family size distribution was (depending on def- 
inition) close to log- normal or close to the power law of reality [151 [13] , with 
large fluctuations. The distribution of Hamming distances had a skewed and 
realistic shape. With one of the definitions, strong differences, as desired, 
were found between the Hamming distances of languages within one family 
and between different families. 
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Equilibration for F = Q = 500, split = f = 1 % 




Figure 8: Time variation in modified model of average Hamming (Man- 
hattan) distance (top) and of Hamming distance distribution (centre and 
bottom) for small F — Q — 500. 
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t=1 million, p=0.01 , q=0.9(left) and 0.1 (right) 
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Figure 9: Comparison of the reality (+) of automated lexicostatistics [T8|[20] 
with equihbrium simulations (yes- no; lines) for F = Q = 1000. Shorter 
simulations with only 100,000 and 300,000 iterations are not significantly 
different. Thus g < 0.1 roughly agrees with the lexicostatistical results. 
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F=Q= 10,000(+), 100,000(x), 200,000n 
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Figure 10: Top: Position of the maximum in the distribution of the yes-no 
Hamming distance, scaled by the largest possible distance F, as a function of 
l/log(t), for three large values of F = Q. Note the flattening for t > 10^. The 
bottom part shows the whole distributions for the largest F — Q — 200, 000 
(unsealed yes-now Hamming distance); sometimes there is a single peak, 
sometimes there are several peaks close together. 
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